This data contains the top 1000 movies by their user scores on IMDb. It’s available here on Kaggle.

Some questions for analysis after browsing the data:

library(tidyverse)
library(knitr)
library(kableExtra)
library(plotly)
library(forcats)
library(png)
library(ggimage)
library(grid)

Importing, inspecting, and a bit of data cleaning

imdb_df <- read.csv("/Users/taylorparchment/Downloads/imdb_top_1000.csv")
glimpse(imdb_df)
## Rows: 1,000
## Columns: 16
## $ Poster_Link   <chr> "https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmN…
## $ Series_Title  <chr> "The Shawshank Redemption", "The Godfather", "The Dark K…
## $ Released_Year <chr> "1994", "1972", "2008", "1974", "1957", "2003", "1994", …
## $ Certificate   <chr> "A", "A", "UA", "A", "U", "U", "A", "A", "UA", "A", "U",…
## $ Runtime       <chr> "142 min", "175 min", "152 min", "202 min", "96 min", "2…
## $ Genre         <chr> "Drama", "Crime, Drama", "Action, Crime, Drama", "Crime,…
## $ IMDB_Rating   <dbl> 9.3, 9.2, 9.0, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8…
## $ Overview      <chr> "Two imprisoned men bond over a number of years, finding…
## $ Meta_score    <int> 80, 100, 84, 90, 96, 94, 94, 94, 74, 66, 92, 82, 90, 87,…
## $ Director      <chr> "Frank Darabont", "Francis Ford Coppola", "Christopher N…
## $ Star1         <chr> "Tim Robbins", "Marlon Brando", "Christian Bale", "Al Pa…
## $ Star2         <chr> "Morgan Freeman", "Al Pacino", "Heath Ledger", "Robert D…
## $ Star3         <chr> "Bob Gunton", "James Caan", "Aaron Eckhart", "Robert Duv…
## $ Star4         <chr> "William Sadler", "Diane Keaton", "Michael Caine", "Dian…
## $ No_of_Votes   <int> 2343110, 1620367, 2303232, 1129952, 689845, 1642758, 182…
## $ Gross         <chr> "28,341,469", "134,966,411", "534,858,444", "57,300,000"…
colSums(is.na(imdb_df))
##   Poster_Link  Series_Title Released_Year   Certificate       Runtime 
##             0             0             0             0             0 
##         Genre   IMDB_Rating      Overview    Meta_score      Director 
##             0             0             0           157             0 
##         Star1         Star2         Star3         Star4   No_of_Votes 
##             0             0             0             0             0 
##         Gross 
##             0

The “Gross” category of this data is a character vector, which will need to be converted to an integer. There are also some blank strings which aren’t getting detected as null values.

# Convert numbers from strings to integers
imdb_df <- imdb_df %>% 
  mutate(Gross = parse_number(Gross))

# Get Gross column's null values
sum(is.na(imdb_df$Gross))
## [1] 169

Additionally, I later noticed that “Joe Russo” is listed as “Star1” several times, but it appears he was an additional director in those movies, and not an actor. For simplicity’s sake, I’ll just delete these cells so he doesn’t appear under any charts as an actor.

imdb_df$Star1[imdb_df$Star1 == "Joe Russo"] <- NA

Comparing scores

# Top movies by user score
top_user <- imdb_df %>% 
  arrange(desc(IMDB_Rating)) %>% 
  select(Series_Title, Released_Year, IMDB_Rating, Meta_score, Gross)

# Top movies by metascore
top_meta <- imdb_df %>% 
  arrange(desc(Meta_score)) %>% 
  select(Series_Title, Released_Year, Meta_score, IMDB_Rating, Gross)

Top 20 Movies by User Score

knitr::kable(head(top_user, 20))
Series_Title Released_Year IMDB_Rating Meta_score Gross
The Shawshank Redemption 1994 9.3 80 28341469
The Godfather 1972 9.2 100 134966411
The Dark Knight 2008 9.0 84 534858444
The Godfather: Part II 1974 9.0 90 57300000
12 Angry Men 1957 9.0 96 4360000
The Lord of the Rings: The Return of the King 2003 8.9 94 377845905
Pulp Fiction 1994 8.9 94 107928762
Schindler’s List 1993 8.9 94 96898818
Inception 2010 8.8 74 292576195
Fight Club 1999 8.8 66 37030102
The Lord of the Rings: The Fellowship of the Ring 2001 8.8 92 315544750
Forrest Gump 1994 8.8 82 330252182
Il buono, il brutto, il cattivo 1966 8.8 90 6100000
The Lord of the Rings: The Two Towers 2002 8.7 87 342551365
The Matrix 1999 8.7 73 171479930
Goodfellas 1990 8.7 90 46836394
Star Wars: Episode V - The Empire Strikes Back 1980 8.7 82 290475067
One Flew Over the Cuckoo’s Nest 1975 8.7 83 112000000
Hamilton 2020 8.6 90 NA
Gisaengchung 2019 8.6 96 53367844

Top 20 Movies by Metascore

knitr::kable(head(top_meta, 20))
Series_Title Released_Year Meta_score IMDB_Rating Gross
The Godfather 1972 100 9.2 134966411
Casablanca 1942 100 8.5 1024560
Rear Window 1954 100 8.4 36764313
Lawrence of Arabia 1962 100 8.3 44824144
Vertigo 1958 100 8.3 3200000
Citizen Kane 1941 100 8.3 1585634
Trois couleurs: Rouge 1994 100 8.1 4043686
Fanny och Alexander 1982 100 8.1 4971340
Il conformista 1970 100 8.0 541940
Sweet Smell of Success 1957 100 8.0 NA
Boyhood 2014 100 7.9 25379975
Notorious 1946 100 7.9 10464000
City Lights 1931 99 8.5 19181
Singin’ in the Rain 1952 99 8.3 8819028
Touch of Evil 1958 99 8.0 2237659
The Night of the Hunter 1955 99 8.0 654000
Shichinin no samurai 1954 98 8.6 269061
North by Northwest 1959 98 8.3 13275000
Metropolis 1927 98 8.3 1236166
Pan’s Labyrinth 2006 98 8.2 37634615

Interestingly, browsing through the top 20 movies by each score, it appears the movies valued by average viewers and critics is quite different. Especially looking at the top metascore movies, we can see some with quite different user scores. I want to know what movie has the biggest difference between the groups.

Finding biggest critic and user score disparity

score_disparity <- imdb_df %>% 
  mutate(score_diff = abs(IMDB_Rating * 10 - Meta_score)) %>% 
  select(Series_Title, Released_Year, score_diff, IMDB_Rating, Meta_score,Genre) %>% 
  arrange(desc(score_diff)) %>% 
  head(10)
knitr::kable(score_disparity)
Series_Title Released_Year score_diff IMDB_Rating Meta_score Genre
I Am Sam 2001 49 7.7 28 Drama
Tropa de Elite 2007 47 8.0 33 Action, Crime, Drama
The Butterfly Effect 2004 46 7.6 30 Drama, Sci-Fi, Thriller
Seven Pounds 2008 40 7.6 36 Drama
Kai po che! 2013 37 7.7 40 Drama, Sport
Fear and Loathing in Las Vegas 1998 35 7.6 41 Adventure, Comedy, Drama
Pink Floyd: The Wall 1982 34 8.1 47 Drama, Fantasy, Music
The Boondock Saints 1999 34 7.8 44 Action, Crime, Thriller
Bound by Honor 1993 33 8.0 47 Crime, Drama
Predator 1987 33 7.8 45 Action, Adventure, Sci-Fi

The movie with the biggest difference between user and critical score is the 2001 drama I Am Sam. All of the movies here were liked by users and disliked by critics, and the majority of the movies have drama listed as at least one genre.

This lets us know what movies users liked and not critics, but I’d like to know the other direction too.

# Look only for movies critics liked
score_disparity2 <- imdb_df %>% 
  mutate(score_diff = Meta_score - IMDB_Rating * 10) %>% 
  select(Series_Title, Released_Year, score_diff, IMDB_Rating, Meta_score,Genre) %>% 
  arrange(desc(score_diff)) %>% 
  head(10)
knitr::kable(score_disparity2)
Series_Title Released_Year score_diff IMDB_Rating Meta_score Genre
Boyhood 2014 21 7.9 100 Drama
Notorious 1946 21 7.9 100 Drama, Film-Noir, Romance
Il conformista 1970 20 8.0 100 Drama
Sweet Smell of Success 1957 20 8.0 100 Drama, Film-Noir
The Lady Vanishes 1938 20 7.8 98 Mystery, Thriller
A Hard Day’s Night 1964 20 7.6 96 Comedy, Music, Musical
Trois couleurs: Rouge 1994 19 8.1 100 Drama, Mystery, Romance
Fanny och Alexander 1982 19 8.1 100 Drama
Touch of Evil 1958 19 8.0 99 Crime, Drama, Film-Noir
The Night of the Hunter 1955 19 8.0 99 Crime, Drama, Film-Noir

It looks like the score differences here are much smaller. It’s probably also due to the fact that this data was selected from user scores, so they are guaranteed not to be very low, whereas critic scores could be any value.

Top grossing movies vs. scores

I’ll get an idea of the highest-earning movies.

# Find the top 100 top grossing movies
top_grossing <- imdb_df %>% 
  arrange(desc(Gross)) %>% 
  select(Series_Title, Released_Year, IMDB_Rating, Meta_score, Gross) %>% 
  head(100)
knitr ::kable(head(top_grossing, 10))
Series_Title Released_Year IMDB_Rating Meta_score Gross
Star Wars: Episode VII - The Force Awakens 2015 7.9 80 936662225
Avengers: Endgame 2019 8.4 78 858373000
Avatar 2009 7.8 83 760507625
Avengers: Infinity War 2018 8.4 68 678815482
Titanic 1997 7.8 75 659325379
The Avengers 2012 8.0 69 623279547
Incredibles 2 2018 7.6 80 608581744
The Dark Knight 2008 9.0 84 534858444
Rogue One 2016 7.8 65 532177324
The Dark Knight Rises 2012 8.4 78 448139099

First, I want to see if how much profit a movie makes is indicative of how well average viewers will like it.

# Top grossing movies vs user scores
# Graph a scatter plot and line of best fit
gross_vs_user <- ggplot(top_grossing, aes(x = Gross, y = IMDB_Rating, text = Series_Title)) +
  geom_point() +
  geom_smooth(aes(group=-1), method="lm", se = FALSE) +
  scale_x_continuous(limits = c(0, 1e9), breaks = seq(0, 1e9, by = 100000000), labels = c("0", "100M", "200M", "300M", "400M", "500M", "600M", "700M", "800M", "900M", "1B")) +
  scale_y_continuous(limits = c(7.5, 9.2)) +
  labs(title = "Top Grossing Movies vs. User Score",
       y = "User Score",
       x = "Gross Profit in Dollars") + theme_minimal() +
  coord_cartesian(xlim = c(0, 1e9), ylim = c(7.5, 9.2))

ggplotly(gross_vs_user)

I’ll try the same thing with metascore.

gross_vs_meta <- ggplot(top_grossing, aes(x = Gross, y = Meta_score, text = Series_Title)) +
  geom_point() + 
  geom_smooth(aes(group=-1), method="lm", se = FALSE) +
  scale_x_continuous(limits = c(0, 1e9), breaks = seq(0, 1e9, by = 100000000), labels = c("0", "100M", "200M", "300M", "400M", "500M", "600M", "700M", "800M", "900M", "1B")) +
  scale_y_continuous(limits = c(50, 100)) +
  labs(title = "Top Grossing Movies vs. Metascore",
       y = "Metascore",
       x = "Gross Profit in Dollars") + theme_minimal()
ggplotly(gross_vs_meta)

While the scores individual top grossing movies received are different between average and critical raters, their spreads are similar, with metascore spread a bit larger. Both show very weak relationships if any to gross profit.

It seems like the reverse of this relationship would be stronger. I would expect user or critical scores to be more indicative of profit.

Top scores vs. gross profit

user_vs_gross <- ggplot(head(top_user, 100), aes(x = IMDB_Rating, y = Gross, text = Series_Title)) +
  geom_point() +
  geom_smooth(aes(group=-1), method="lm", se = FALSE) +
  labs(title = "Top 100 IMDb Movies by User Score vs. Gross Profit",
       y = "Gross Profit in Dollars",
       x = "User Score") + 
  scale_y_continuous(breaks = seq(0, 1e9, by = 100000000), labels = c("0", "100M", "200M", "300M", "400M", "500M", "600M", "700M", "800M", "900M", "1B")) + theme_minimal()
ggplotly(user_vs_gross)

The relationship between them is still less obvious than I would have expected. Let’s look into the relationship a bit.

summary(lm(formula = Gross ~ IMDB_Rating, data = top_user))
## 
## Call:
## lm(formula = Gross ~ IMDB_Rating, data = top_user)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -102820436  -62517252  -41910193   17363997  870372054 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -227376280  106535840  -2.134  0.03311 * 
## IMDB_Rating   37172968   13397415   2.775  0.00565 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109300000 on 829 degrees of freedom
##   (169 observations deleted due to missingness)
## Multiple R-squared:  0.009201,   Adjusted R-squared:  0.008006 
## F-statistic: 7.699 on 1 and 829 DF,  p-value: 0.005651

With a p-value of 0.005651, we can assume it’s statistically significant.

meta_vs_gross <- ggplot(head(top_meta, 100), aes(x = Meta_score, y = Gross, text = Series_Title)) +
  geom_point() +
  geom_smooth(aes(group=-1), method="lm", se = FALSE) +
  labs(title = "Top 100 IMDb Movies by Metascore vs. Gross Profit",
       y = "Gross Profit in Dollars",
       x = "Metascore") + 
  scale_y_continuous(breaks = seq(0, 1e9, by = 100000000), labels = c("0", "100M", "200M", "300M", "400M", "500M", "600M", "700M", "800M", "900M", "1B")) + theme_minimal()
ggplotly(meta_vs_gross)

The connection between metascore and profit looks weaker. It seems ridiculous that a higher metascore would reduce profits. I’ll check this too.

summary(lm(formula = Gross ~ Meta_score, data = top_meta))
## 
## Call:
## lm(formula = Gross ~ Meta_score, data = top_meta)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -87279147 -68401159 -43159626  23025583 862414862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 96442843   26009397   3.708 0.000224 ***
## Meta_score   -277444     331500  -0.837 0.402897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113400000 on 748 degrees of freedom
##   (250 observations deleted due to missingness)
## Multiple R-squared:  0.0009356,  Adjusted R-squared:  -0.0004001 
## F-statistic: 0.7005 on 1 and 748 DF,  p-value: 0.4029

This relationship is not significant.

Of course, thinking about this sensibly, it’s reasonable that when the average viewer likes a movie, that would drive profit much more than good critical response only. This is also very likely due to the bias of the dataset – this is a set of top 1000 user rated movies, so this data is heavily skewed towards them. A dataset of highest scoring critical reviews could be more informative of its relationship to profit.

Top grossing actors and directors

Next I’ll take a look at some of the actors and directors who have been a part of the highest grossing movies total.

top_grossing_actors <- imdb_df %>% 
  pivot_longer(cols = starts_with("Star"), values_to = "Actor") %>%   # Get each actor in one row
  filter(!is.na(Actor)) %>% 
  group_by(Actor) %>% 
  summarize(Total_Gross = sum(Gross, na.rm = TRUE), Gross_Per_Movie = mean(Gross, na.rm = TRUE)) %>% 
  arrange(desc(Total_Gross))

top_grossing_directors <- imdb_df %>% 
  pivot_longer(cols = Director, values_to = "Director") %>%
  group_by(Director) %>% 
  summarize(Total_Gross = sum(Gross, na.rm = TRUE), Gross_Per_Movie = mean(Gross, na.rm = TRUE)) %>% 
  arrange(desc(Total_Gross))
# Highest gross actor chart
ggplot(head(top_grossing_actors, 10), aes(x = fct_reorder(Actor, Total_Gross), y = Total_Gross)) +
  geom_col() +
  labs(title = "Total Gross Profit of Actors' Movies",
       x = "Actor",
        y = "Total Gross Movie Profit") + 
  scale_y_continuous(breaks = seq(0, 3e9, by = 500000000), labels = c("0", "500M", "1B", "1.5B", "2B", "2.5B", "3B")) + coord_flip() + theme_minimal()

# Highest gross director chart
ggplot(head(top_grossing_directors, 10), aes(x = fct_reorder(Director, Total_Gross), y = Total_Gross)) +
  geom_col() +
  labs(title = "Total Gross Profit of Directors' Movies",
       x = "Director",
        y = "Total Gross Movie Profit") + 
  scale_y_continuous(breaks = seq(0, 3e9, by = 500000000), labels = c("0", "500M", "1B", "1.5B", "2B", "2.5B", "3B")) + coord_flip() + theme_minimal()

Investigating movie genres

Next, I want to see which movie genre is the most represented in the top thousand movies.

# Make data longer by separating Genre column, count totals
total_genre_count <- imdb_df %>% 
  separate_rows(Genre, sep = ",\\s*") %>% 
  group_by(Genre) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))

ggplot(total_genre_count, aes(x = fct_reorder(Genre, count), y = count)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  labs(title = "Total Genre Count",
       x = "Genre",
       y = "Total Number of Movies")

I’m also curious what some of the most prolific actors’ and directors’ favorite genres are. I’ll see which genres they are listed the most in.

# Get each actors' total number of times acted in each genre
# Elongate data by genre and actor, group by actor and each genre
actor_genre_count <- imdb_df %>% 
  separate_rows(Genre, sep = ",\\s*") %>% 
  pivot_longer(cols = starts_with("Star"), names_to = NULL, values_to = "Actor") %>% 
  group_by(Actor, Genre) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))

# Get the director's total genres too
director_genre_count <- imdb_df %>% 
  separate_rows(Genre, sep = ",\\s*") %>% 
  group_by(Director, Genre) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
knitr::kable(head(actor_genre_count, 20))
Actor Genre count
Robert De Niro Drama 17
Al Pacino Drama 13
Robert De Niro Crime 12
Al Pacino Crime 11
Brad Pitt Drama 9
Christian Bale Drama 9
Denzel Washington Drama 9
Ethan Hawke Drama 9
Leonardo DiCaprio Drama 9
Tom Hanks Drama 9
Harrison Ford Action 8
Johnny Depp Drama 8
Aamir Khan Drama 7
Ian McKellen Adventure 7
Jake Gyllenhaal Drama 7
James Stewart Drama 7
Morgan Freeman Drama 7
Russell Crowe Drama 7
Tom Hanks Adventure 7
Bill Murray Comedy 6
knitr::kable(head(director_genre_count, 20))
Director Genre count
Hayao Miyazaki Animation 11
Akira Kurosawa Drama 9
Alfred Hitchcock Mystery 9
Alfred Hitchcock Thriller 9
Hayao Miyazaki Adventure 9
Martin Scorsese Drama 9
Billy Wilder Drama 8
David Fincher Drama 8
Martin Scorsese Crime 8
Woody Allen Comedy 8
Clint Eastwood Drama 7
Ingmar Bergman Drama 7
Quentin Tarantino Drama 7
Stanley Kubrick Drama 7
Steven Spielberg Drama 7
Charles Chaplin Comedy 6
Wes Anderson Comedy 6
Alfonso Cuarón Drama 5
Alfred Hitchcock Drama 5
Andrei Tarkovsky Drama 5

Conclusions

I think most people know that audiences and critics tend to value different movies, but it’s interesting to see that the top audience and critic favorites are almost entirely different, and some movies are especially divisive. User-liked dramas are often subject to lower critical scores, while old, film-noir movies are extremely highly rated by critics, but were only received normally by the general audience.

When it comes to profit and scores, profit doesn’t tell us much about how viewers might have rated a movie, but higher general audience scores do seem indicative of higher gross profit. This is not true, however, of critic scores.

Among the actors who’ve been in the most top-grossing movies, we see a lot of actors from the Marvel franchise (Robert Downey Jr, Chris Evans, Mark Ruffalo) and the Harry Potter series (Daniel Radcliffe, Rupert Gint).

As for genres, drama is by far the most commonly listed. Of course, movies can be listed under several genres, and it’s pretty hard to have a movie without some kind of drama, so it might not be very informative. Looking at actors and directors, some that stand out are Robert De Nero for his number of drama and crime listings, and Hayao Miyazaki for his 11 animations.

Finally, I’ll make a nice visual of the two most controversial movies of the list.

boyhood <- "/Users/taylorparchment/Desktop/Boyhood_(2014).png"
iamsam <- "/Users/taylorparchment/Desktop/IAmSam.png"

# filter two controversial titles
# get their two score types on different rows
controversial <- imdb_df %>% 
  filter(Series_Title == "Boyhood" | Series_Title == "I Am Sam") %>% 
  mutate(Viewers = IMDB_Rating * 10,
         Critics = Meta_score) %>% 
  pivot_longer(cols = c(Viewers, Critics), names_to = "Score_Type", values_to = "Score") 


movie_bar_chart <- ggplot(controversial, aes(x = Series_Title, y = Score, fill = Score_Type)) + 
  
  # side by side bars
  geom_col(position = position_dodge()) +
  
  #change visual details
  theme_transparent() + 
  coord_cartesian(ylim = c(0, max(controversial$Score) + 60)) +
  scale_y_continuous(breaks=c(25, 50, 75, 100)) +
  
  # display and format labels
  labs(title = "Viewers vs Critics",
       subtitle = "Most controversial movies between viewers and critics of IMDb's top 1000",
       fill = "") +
  theme(legend.position = "bottom",
    plot.title = element_text(face = "bold",
                              size = 17, hjust = 0.5)) +
  
  # add score value labels on the bars
  geom_text(
    aes(label = Score),
    color = "white", size = 3,
    vjust = 2, position = position_dodge(.9), fontface ="bold") +
  
  # add Boyhood text
  geom_label(aes(x= 0.41, y = 150), label = "2015 experimental film \nfollowing the real \nadolescence of boy's life", size = 3, hjust = 0, vjust = 0.5, show.legend = FALSE) +
  geom_label(aes(x= 0.41, y = 134), label = "Won Oscar for best\nsupporting performance", size = 3, hjust = 0, vjust = 0.5, show.legend = FALSE) +
  geom_label(aes(x= 0.41, y = 118), label = 'Hailed as "epic in scope",\n"astonishing achievement"\nby critics', size = 3, hjust = 0, vjust = 0.5, show.legend = FALSE) +
  
  # add I am Sam text
    geom_label(aes(x= 1.99, y = 150), label = "2001 drama about a\ndisabled man's fight for\ncustody of daughter", size = 3, hjust = 0, vjust = 0.5, show.legend = FALSE, fill = "lightblue") +
  geom_label(aes(x= 1.99, y = 134), label = 'Described as "powerful",\n"heartwarming" by viewers' , size = 3, hjust = 0, vjust = 0.5, show.legend = FALSE, fill = "lightblue") +
  geom_label(aes(x= 1.99, y = 118), label = 'Critics say "contrived",\n"insensitive", "shamelessly\nsentimental"' , size = 3, hjust = 0, vjust = 0.5, show.legend = FALSE, fill = "lightblue") +
  
  # remove x-axis label
  xlab("") +
  
  # display movie images
  geom_image(
    aes(image = boyhood), x = 1.27, y =135, size = 0.3, by = "height") +
  
  geom_image(
    aes(image = iamsam), x = 1.73, y =135, size = 0.3, by = "height") 
print(movie_bar_chart)

# ggsave("/Users/taylorparchment/Desktop/imdb_chart.png", movie_bar_chart, width = 6, height = 7, bg = "white")